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Abstract 

The spontaneous breaking of chiral symmetry in QCD is known to be linked to a non-zero 
density of eigenvalues of the massless Dirac operator near the origin. Numerical studies of 
two-flavour QCD now suggest that the low quark modes are locally coherent to a certain 
extent. As a consequence, the modes can be simultaneously deflated, using local projectors, 
with a total computational effort proportional to the lattice volume (rather than its square) . 
Deflation has potentially many uses in lattice QCD. The technique is here worked out for 
the case of quark propagator calculations, where large speed-up factors and a flat scaling 
behaviour with respect to the quark mass are achieved. 



1. Introduction 

The physical masses of the up and down quarks are much smaller than the typical 
low-energy hadronic scales such as the pion decay constant and the string tension. In 
numerical lattice QCD, the smallness of the quark masses still is a source of difficulty, 
for various reasons, but mainly because the available simulation techniques become 
inefficient close to the chiral limit. 

It is not excluded, however, that many of the present limitations in lattice QCD 
can be overcome by "deflating QCD", i.e. by treating the eigenmodes of the Dirac 
operator with small eigenvalues separately from the bulk of the quark modes. Defla- 
tion techniques are used in many areas of applied science and they are also an active 
research topic in numerical mathematics (see refs. [1,2], for example, and references 
quoted there). In lattice QCD low- mode deflation was so far mainly used in connec- 
tion with statistical error reduction methods [3-7] that now go under the headings of 
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low-mode averaging and all-to-all propagators. Other applications of deflation meth- 
ods in QCD include quark propagator computations in special situations, where only 
a small number of modes need to be deflated [8-10]. 

In the large-volume regime of QCD, the low-mode deflation methods proposed to 
date however tend to become useless in practice, because the number of eigenvalues 
of the Dirac operator below any fixed value, say 100 MeV, grows proportionally to 
the four-dimensional volume V of the lattice. The computational effort required for 
the calculation of the low quark modes and the deflation operations scales like V 2 
in this situation (or even a higher power of V) and eventually offsets the benefits of 
low-mode deflation. As Banks and Casher [11] noted long ago, the average spectral 
density of the low quark modes is proportional to the quark condensate in the chiral 
limit. The F 2 -problem is thus directly linked to the spontaneous breaking of chiral 
symmetry and is therefore present independently of the chosen lattice formulation 
of the theory. 

At present little appears to be known about the space-time structure of the low 
quark modes, but a simple numerical inspection, reported in sect. 5, suggests that 
they are locally coherent to some extent. This property allows highly effective defla- 
tion subspaces to be built from only a few low modes, using block projectors. The 
numerical effort required for the preparation of the subspace and the deflation of 
the Dirac operator is then only of order V (rather than V 2 ). 

Before going into the details of the construction in sects. 4 and 5, the practical 
relevance of the y 2 -problem is briefly discussed in sect. 2 and it is explained, in 
sect. 3, how to deflate the Dirac operator if the deflation subspace is not spanned by 
exact eigenmodes of the operator. The potential of the proposed deflation method is 
demonstrated in sect. 6, where a preconditioned solver for the lattice Dirac equation 
is described, whose efficiency decreases only slightly with the quark mass and which 
outperforms any solver previously used in lattice QCD by a large factor. 



2. Spectral density and the V 2 problem 

2. 1 Lattice parameters & field ensembles 

All simulation results reported in this paper were obtained using the 0(a)-improved 
Wilson formulation of lattice QCD [12,13] with two flavours of mass-degenerate sea 
quarks. Only two lattices, of size 48x 24 3 and 64x32 3 , were considered, both at the 
same inverse gauge coupling j3 = 5.3, sea-quark hopping parameter K sea = 0.13625 
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and value c sw = 1.90952 [14] of the Sheikholeslami-Wohlert improvement term. At 
this point in parameter space, the lattice spacing a in physical units is estimated to 
be 0.0784(10) fm [15], while the sea-quark mass is roughly equal to a quarter of the 
physical strange-quark mass m s . 

Representative ensembles of gauge-field configurations on these two lattices were 
generated by the authors of ref. [15] and were made available for the studies con- 
ducted here. The ensembles consist of 169 and 50 configurations, widely separated 
in simulation time so that the residual autocorrelations can, in most cases, be ex- 
pected to be negligible. However, the discussion that follows is intended to be largely 
qualitative and the quoted errors and any systematic uncertainties will therefore be 
generously ignored. 

2. 2 Computation of the spectral density 

In the Wilson theory, the spectrum of the (massive) lattice Dirac operator D is sup- 
ported in an elliptic region in the complex plane and is thus not easily compared with 
the spectrum of the Dirac operator in the continuum theory and the Banks-Casher 
formula. This difficulty can be bypassed by considering the hermitian operator D'D 
instead of D, a choice which has other advantages as well. The computation of the 
low-lying eigenvalues of the operator, for example, becomes relatively straightfor- 
ward. In this paper all eigenvalue and eigenmode calculations were performed using 
Chebyshev-accelerated subspace iterations (see appendix A of ref. [16]). 

The spectral density of (D^D) 1 / 2 , averaged over the ensemble of gauge-field con- 
figurations on the 48 x 24 3 lattice, is shown in fig. 1. Perhaps the most interesting 
feature of this distribution is that it is practically constant above the threshold re- 
gion at the low end of the spectrum. The threshold of the density in infinite volume 
is, incidentally, expected to be at Zpjn sea . [16], where Za and m sea denote the axial 
current renormalization constant and the bare current-quark mass of the sea quark 
(Z A = 0.75(1) [17] and om sea = 0.00761(7) [15] on the lattices considered here). As 
can be seen from the figure, this value appears to give a good indication on where 
the bulk of the spectrum in finite volume begins. 

As discussed in ref. [16], the spectral density of (D^ D) 1 / 2 renormalizes multiplica- 
tively, the renormalization factor Zp being the same as the one of the pseudo-scalar 
density. For the specified lattice parameters, the conversion factor from the lattice 
to the MS scheme of dimensional regularization at renormalization scale fi = 2 GeV 
was recently determined to be Zp 1 = 1.84(3) [20]. The range of eigenvalues in fig. 1 
thus extends up to about 121 MeV after conversion to the MS scheme, i.e. to a value 
approximately 25% larger than the physical mass of the strange quark [18-20]. 

The spectral density on the 64x32 3 lattice was also computed and turned out to be 
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Fig. 1. Unrenormalized density p(A) of the eigenvalues A of (D* D) 1 ^ 2 on the 48x24 3 
lattice, in units of 'number of eigenvalues per MeV and fm 4 '. The lattice parameters 
are as specified in subsect. 2.1 and the dotted vertical line indicates the theoretically 
expected position of the threshold of the density in infinite volume [16]. 

nearly the same as the one on the 48 x 24 3 lattice. In particular, the average number 
of eigenmodes in the range covered by fig. 1 increases from 29 on the smaller lattice 
to about 89 on the big lattice, which shows that the y 2 -problem is not an academic 
one. The computation of the 32 lowest eigenvalues and associated eigenmodes of 
D'D on the 48 x 24 3 lattice, for example, to a relative precision of 10 -3 , is in fact 
already a heavy task that requires the Dirac operator to be applied some 2.5 x 10 5 
times. 



2.3 Comparison with the Banks-Casher formula 

According to the Banks-Casher relation [11], the average number n(M) of eigenval- 
ues of the massless Dirac operator of magnitude less than M is, in the continuum 
theory, given by 

n(M) = -MZV + 0(M 2 ), (2.1) 

7T 

where S denotes the «-quark condensate in the thermodynamic limit. This formula 
holds in any renormalization scheme, but £ must refer to a definite normalization 
prescription. A recently quoted result in two-flavour QCD for the condensate in the 
MS scheme is £ = (251 ± 13 MeV) 3 [21]. Setting M = 100 MeV for illustration, and 
assuming a 2L x L 3 lattice, eq. (2.1) then yields the estimates n{M) = 21, 108 and 
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342 for the average number of quark modes below M at L = 2, 3 and 4 fm. 

These figures are in a similar range as the numerically determined mode numbers 
reported in the previous subsection. A quantitative comparison must however take 
into account the exact physical sizes of the simulated lattices and the fact that the 
lattice Dirac operator D includes the quark mass term. Inserting again the value of 
£ quoted above and converting the lattice sizes to physical units (using a = 0.0784 
fm), the Banks-Casher formula then predicts the average number of eigenvalues in 
the range covered by fig. 1 to be 20 and 63, respectively, on the 48 x 24 3 and the 
64 x 32 3 lattice. These values are smaller than the actual numbers (29 and 89) of 
low modes on these lattices, but they are in the same ballpark and one should also 
not forget that there are systematic uncertainties in these calculations. 



3. Inexact deflation 

It should be quite clear at this point that good deflation methods in QCD should not 
assume the low eigenmodes of the Dirac operator to be accurately known. Eventually 
the only requirements are that the method is efficient and that the correctness of 
the final results is guaranteed. Inexact deflation was already discussed in ref. [10], 
for example, and will be driven to the extreme in this paper, partly following recent 
developments in the mathematical literature [1,2]. 

3.1 Oblique projector algebra 

Deflation methods in QCD usually start from a set of quark fields, 4>x(x), . . . , 0at(x), 
which will here be assumed to be orthonormal but are otherwise left unspecified t. 
The orthogonal projector P to the space S spanned by these fields (the deflation 
subspace) acts on a given quark field ip(x) according to 



where the bracket (%, tp) denotes the obvious scalar product in the linear space of 
all quark fields. 

f The term quark field is reserved for lattice Dirac fields that carry a colour but no flavour index. 
The eigenmodes of D with small eigenvalues are referred to as the low quark modes or, somewhat 
abusively, as the low modes of the Dirac operator. 
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(3.1) 



k=i 
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The restriction of the lattice Dirac operator D to the deflation subspace is referred 
to as the little Dirac operator. It is completely specified by the matrix 

A kl = {<t> k ,D<i>i), k,l = l,...,N, (3.2) 

that represents its action on the basis fields. In the following, the little Dirac operator 
is assumed to be invertible, a requirement that will always be satisfied in practice. 
The linear operators 

N 

P L ^(x) = i>{x) - DfaWiA-^u {<j> u Y>) , (3.3) 
k,i=i 

N 

P R ^(x) = Mx) - <M^~% (<Pi,D*P) , (3.4) 
k,i=i 

can then be defined, where the subscripts stand for "left" and "right" because Pl 
and P R usually appear on the left and right of the Dirac operator. These operators 
are oblique projectors, i.e. they are not hermitian but satisfy 

P 2 L = P L , Pl = P R . (3.5) 

Other algebraic identities that follow directly from the definitions (3.1)-(3.4) are 

P L D = DP R , (3.6) 
PP L = P R P = 0, (3.7) 
P L (1-P) = (l-P)P R = 1-P (3.8) 

In particular, Pl projects to the orthogonal complement of the deflation subspace. 

3.2 Deflation of the Dirac equation 
The inhomogeneous Dirac equation, 

Di/j(x) = r](x), (3.9) 

may now be split into two independent equations by acting with the projectors Pl 
and 1 — Pl from the left. The second equation can be solved immediately and the 
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solution of the full system is then given by 
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Y>(x) = X (x) + M^iA-'hi ??) , (3.10) 

k,l=l 

where x( x ) must solve the deflated system 

P L D X (x) = P LV (x) (3.11) 

subject to the constraint (1— Pr)x{x) = 0. In view of the commutator property (3.6), 
this constraint is consistent with the deflated system and can be freely imposed. One 
may actually solve the deflated equation (3.11) without imposing any constraint and 
simply apply Pr to the calculated solution at the end of the computation. 
The full quark propagator S(x, y) can be similarly split into two parts, 

N 

S(x,y) = P R S(x,y) + £ (j) k (x)(A~ 1 ) k i(f>i(y)^ , (3.12) 
k,i=i 

the second term being the contribution along the deflation subspace while the first 
coincides with the Green function of the deflated system (3.11). In practice eq. (3.12) 
may be a starting point for the application of variance reduction methods such as 
those described in refs. [4,7]. 

3.2 Deflation efficiency 

Some insight into why deflation is potentially beneficial is obtained by noting that 
the deflated operator 

D = P L D = P L D(1 - P) (3.13) 

acts in the orthogonal complement S 1 - of the deflation subspace. Moreover, a little 
algebra shows that D is the Schur complement of D with respect to S and that its 
inverse in S 1 - is given by 

b~ l = (1 - P)D~ 1 (1 - P). (3.14) 



The condition number of the deflated system (3.11) is thus expected to be signifi- 
cantly smaller than the condition number of the full system if the low modes of the 
Dirac operator are sufficiently suppressed by the projector 1 — P. 



7 



For any given normalized quark field ip(x), the deficit 



e= 11(1 



(3.15) 



provides a practical measure of how well the field is approximated by the deflation 
subspace. Useful subspaces will have to be such that all low quark modes (in, say, 
the range considered in sect. 2) have small deficits e. However, contrary to what may 
be presumed, the construction of such subspaces does not require the low modes to 
be computed to any accuracy (see sect. 5). 



4. Domain-decomposed subspaces 

The deflation subspaces considered in the following are based on a division of the 
lattice into non-overlapping rectangular blocks of lattice points. Domain decompo- 
sitions of this kind were previously introduced for the Schwarz preconditioning of 
the Dirac operator and the HMC algorithm [22,23], but the subspaces constructed 
in this paper are not linked to the Schwarz preconditioning and can be used in many 
different ways. 

4-1 Block projection method 

Once the lattice is divided into blocks, local deflation subspaces may be defined by 
specifying N s orthonormal quark fields <f>i(x), 1 = 1,... ,N S , on each block A. The 
full deflation subspace is then spanned by the set of all these local subspaces and thus 
has dimension N = N^N S , where Nf, denotes the number of blocks in the lattice. In 
particular, at fixed block size, the total number of basis fields scales proportionally 
to the lattice volume V. 

Subspaces of this kind fit the general framework discussed in the previous section 
if the basis fields are relabelled by an index k running from 1 to N. The little Dirac 
operator, the deflation projectors and the deflated Dirac operator are thus defined 
as before. An obvious advantage of the construction is that the application of the 
projector P to a given quark field tp(x), 




(4.1) 



A 1=1 
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requires a number of arithmetic operations proportional to the lattice volume times 
N s (rather than N). From the point of view of the operations count and the memory 
requirements, the subspace thus behaves as if it were spanned by only N s fields. A 
notable exception to this rule is the little Dirac operator, which always acts in a 
space of dimension N. 

4-2 Building domain-decomposed subspaces from global fields 

In practice the block fields (f)f{x), I = 1, . . . , N s , will be obtained starting from a set 
of N s globally defined quark fields ipi(x). The procedure is very simple and begins 
by projecting the input fields to the blocks, i.e. by defining the fields 

. [ ibAx) if x G A, 

= , (4.2) 

[ otherwise, 

on each block A. The Gram-Schmidt orthonormalization process is then applied to 
these and the orthonormalized fields are taken to be the basis elements (j)^(x). 

The subspace generated in this way contains the fields ipi(x), but since the number 
of basis fields is multiplied by the number of blocks, the subspace tends to be much 
larger than the space spanned by the input fields. 

4-3 Deflation of the free-quark theory 

For illustration and in order to motivate the further developments, it is now helpful 
to briefly consider the case of the free-quark theory. As will become clear below, 
a good choice of the basis fields in this theory are the constant modes. Since the 
quark fields carry a Dirac and a colour index, one has N s = 12 orthonormal constant 
modes on each block. 

If periodic or anti-periodic boundary conditions are imposed, the eigenmodes of 
the Dirac operator are plane waves of the form 

iP p (x) = u p e ipx , (4.3) 

where u p is a spinor that depends on the momentum p but not on the position x. 
Assuming an L 4 lattice and a block division into blocks of size 6 4 (where L is an 
integer multiple of b) , a straightforward computation then shows that 

||(1 - P)^|| 2 = e p ||Vg| 2 , e p = ^p 2 (b 2 - a 2 ) + 0(p% 4 ). (4.4) 

The projection to the orthogonal complement of the specified deflation subspace thus 
suppresses the low-momentum modes by a factor proportional to p 2 (see fig. 2). 
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Fig. 2. Approximation of a plane wave by a superposition of constant block modes. 
In the free-quark theory, piecewise constant deflation modes achieve high deflation 
efficiencies up to momenta p on the order of the inverse of the block size b. 

A second and perhaps more important observation is that the deflation efficiency 
does not depend on the lattice size. Even on very large lattices, all low modes with 
momenta p of magnitude up to some fraction of 1/6 are deflated with small deficits 
e p . Figure 2 also illustrates the fact that high deflation efficiencies can be achieved 
by subspaces of fields that are only piecewise smooth, i.e. fields that are far from 
being approximate eigenmodes of the Dirac operator. 



5. Local coherence and subspace generation 

The discussion in the previous section suggests that the l/ 2 -problem can perhaps 
be solved using domain-decomposed deflation subspaces. However, no general pre- 
scription was given so far of how to choose the fields vpi(x), I = 1,...,N S , from 
which these subspaces are built (cf. subsect. 4.2). Such a prescription will now be 
developed, based on a property of the low quark modes referred to as local coherence. 

5.1 Smoothness & local coherence 

In the free-quark theory, the block projection method works out because the low- 
momentum modes are smooth on the scale of the block size b. The intuitive picture 
that goes along with this explanation is rather appealing but may be difficult to 
carry over to the full theory. In particular, the notion of smoothness ceases to have 
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a well-defined meaning in presence of a non-trivial lattice gauge field. 

A related concept which is better adapted to the situation in the full theory is local 
coherence. Loosely speaking, a set of quark fields is referred to as locally coherent if 
the fields are locally well approximated by a relatively small number of fields. When 
projected to the blocks of a block lattice, for example, such fields are contained in 
small subspaces of block fields, up to small deficits that depend on the block size 
and the dimension of the local subspaces. 

It is quite clear that the block projection method can only work out if the low quark 
modes are locally coherent in this sense. Whether this is so appears to be difficult to 
tell on the basis of simple reasoning alone. The free-quark theory certainly provides 
little guidance at this point, because the physics of the low modes is completely 
different from the one in the full theory. 

5.2 Numerical experiments 

Local coherence is a property that can be investigated numerically in a straightfor- 
ward manner. One begins with an accurate computation of the low-lying eigenvalues 
and associated eigenmodes of D and constructs a domain-decomposed subspace 
from an arbitrary subset of the calculated modes, following the lines of subsect. 4.2. 
The question is then whether all other low modes are also contained in this subspace, 
up to small deficits e. 

Several numerical experiments of this kind were performed in two-flavour QCD on 
the lattices specified in subsect. 2.1. The results are quite impressive and unambigu- 
ously show that the low modes in this theory are locally coherent to a high degree. 
Moreover, the property appears to hold for every individual gauge-field configuration 
and not just on average. 

If the 64 x 32 3 lattice is divided into blocks of size 4 4 , for example, and if 12 
eigenmodes out of 48 are selected for the construction of the domain-decomposed 
subspace, the remaining 36 modes turn out to lie in the subspace up to deficits e 
ranging from 0.03 to 0.06. The deficits increase with the block size, but become 
smaller if more modes are used for the subspace construction. On the 48 x 24 3 
lattice the situation is practically the same, i.e. similar deficits are obtained for a 
given block size and subspace dimension. 

5.3 Subspace generation 

As explained in subsect. 4.2, the deflation subspaces constructed in this paper are 
obtained by restricting a set of quark fields ipi(x), I = 1, ... ,iV a , to the blocks of 
a block division of the lattice. The fields could be taken to be low eigenmodes of 
the Dirac operator, but it is far more economical to generate them by a relaxation 
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process, starting from a set of random fields. 

A relaxation method that can be used in this context is inverse iteration, where 
the fields are updated a number of times according to 

^(x)- u D- ln ^{x), l = l,...,N a . (5.1) 

The inverse of the Dirac operator is put in quotes in this formula, because an accurate 
solution of the Dirac equation is not required. The application of a few cycles of 
the Schwarz alternating procedure [22], for example, actually already has the desired 
relaxation effect. Moreover, the procedure can be bootstrapped by using the current 
set of fields to deflate the Dirac equation and thus to accelerate the approximate 
solution of the equation in the next step (see sect. 6). 

Inverse iteration rapidly depletes the components of the fields parallel to the high 
modes of the Dirac operator. After a few cycles, the fields then satisfy the bound 

PV/II <M||^||, l = l,...,N s . (5.2) 

for some value of M in the range of the low eigenvalues of (D^ D) 1 / 2 . 

An important remark is now that such fields are, to a good approximation, linear 
combinations of the low quark modes. They are therefore locally coherent with these 
and consequently generate domain-decomposed subspaces that approximate the low 
modes up to small deficits. Some further experimenting actually confirms this and 
also shows that the deflation efficiencies are not very different from those achieved 
by domain-decomposed subspaces built from exact low modes. 

5.4 Choice of parameters 

The deflation efficiency of the subspaces generated in this way depends on the block 
size, the dimension N s of the local subspaces and on the number and quality of 
inverse iteration steps that were applied. Choosing small blocks and large numbers 
N s of fields results in high deflation efficiencies but tends to increase the condition 
number of the little Dirac operator and thus the computer time required for the 
application of the oblique projectors Pl and Pr. Similarly, the beneficial effects of 
high numbers of fields and inverse iteration steps must be balanced against the effort 
spent for the subspace generation. 

On the lattices specified in subsect. 2.1, choosing blocks of size 4 4 and setting 
N s = 20 turns out to be a good compromise. Highly efficient deflation subspaces are 
obtained in this case if the relaxation procedure is stopped when the bound (5.2) is 
satisfied for a value of M in the MS scheme equal to 100 MeV or so (cf. sect. 2). 
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This level is reached after 11 inverse iteration steps and requires a computational 
effort equivalent to about 190 applications of the Dirac operator per field (if slightly- 
less effective deflation subspaces are acceptable, one can do with 8 steps and 130 
applications) . 

In general the parameters will have to be tuned empirically. If a deflated solver 
program like the one described in the next section is available, the deflation efficiency 
of a given subspace can be quickly determined by measuring the time required for the 
solution of the Dirac equation to a specified accuracy. Computations of the low quark 
modes are then again not required. The inverse iteration steps can, incidentally, be 
carried out at a valence quark mass different from the sea-quark mass. For reasons 
of efficiency, it is in fact recommended to set the bare mass in this process to a value 
close to (or even equal to) the critical mass. 



6. Deflation-accelerated solver for the Dirac equation 

Low-mode deflation is expected to be useful in several areas of lattice QCD, some 
of which [3-10] were already mentioned in sect. 1. The principal goal in this section 
is to show, in a concrete case, that the deflation subspaces constructed following the 
prescriptions given in the previous section are very effective and that they actually 
do provide a solution to the F 2 -problem. 

6.1 Preconditioned Dirac equation 

Once the deflation subspace is generated, the deflated Dirac equation (3.11) can be 
solved straightforwardly using any of the well-known Krylov space algorithms (see 
ref. [24], for example). However, from the point of view of the execution time, such 
a solver may not perform too well, because the little system 



must be solved, for one source vector w = (w\, . . . , wn), each time the projector Pl 
is applied. As explained in appendix A, there are efficient algorithms to solve the 
little Dirac equation, but the computational effort remains non-negligible. 

A better balance of deflation and other operations can be achieved by right-pre- 
conditioning the deflated equation. The solver discussed in the following includes the 
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(6.1) 
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Schwarz preconditioner M sap introduced in ref. [22], but a polynomial preconditioner 
or a fixed number of GMRES iterations [25,26] may do just as well. In the case of 
the Schwarz preconditioner, the preconditioned deflated equation reads 

P L DM seLp( f>(x) = P L r,(x) (6.2) 

and the solution of eq. (3.11) is then given by x( x ) = PrM sb , p <)>(x). The important 
point to note is that the preconditioning reduces the iteration count of the Krylov 
space algorithm and thus the overhead generated by the deflation projector. 

6.2 Krylov space solver and the deflation-relaxation interplay 

Both the Schwarz preconditioner and the deflation projector involve approximate 
iterative procedures. The GCR algorithm is a recommended Krylov space solver in 
this situation, because it allows for inexact preconditioning without compromising 
the correctness of the solution (see ref. [24] for a general discussion and ref. [22] for 
a description of the algorithm in the context of lattice QCD). 

An interesting feature of the GCR algorithm is that the Krylov space is extended, 
in each step, in a direction = M sap/ o(x) where p(x) denotes the current residue. 
The latter satisfies Plp(x) = p(x) by construction and is therefore orthogonal to the 
deflation subspace (cf. sect. 3). When acting on such a field, the alternating Schwarz 
procedure (which is basically a relaxation method) tends to be quite effective in 
producing an approximate solution of the Dirac equation D£(x) = p(x). Low-mode 
deflation thus has the effect of improving the efficiency of the preconditioner. 

Once £(x) is calculated, the minimal residue in the so extended Krylov space is 
determined by computing PlD^(x) and by applying an orthogonalization process. 
There is thus an interplay between deflation and relaxation, where the low-mode and 
the high-mode components of the residue are reduced in alternation by the deflation 
projector and the Schwarz preconditioner. 

6.3 Performance tests 

The performance of the complete algorithm (DFL+SAP+GCR for short) was deter- 
mined on the lattices specified in subsect. 2.1, at the values of the (valence) quark 
mass that correspond to the hopping parameters K va i listed in table 1. In this range 
of masses, the bare current-quark mass m va i decreases from the strange-quark mass 
m s to approximately \m s [15], where the condition number of the Dirac operator 
reaches a value of about 1900. 

In order for the effects of low-mode deflation to be clearly seen, the performance 
measurements were extended to the even-odd preconditioned BiCGstab algorithm 
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Table 1. Average solver iteration numbers Nx and executing times t* 



EO+BiCGstab SAP+GCR DFL+SAP+GCR 



Lattice K va i N BiC G t [sec] A^gcr t [sec] N GC r t [sec] 



48 x 24 3 


0.13550 


314 


57 


50 


35 


17 


15 




0.13590 


492 


90 


78 


56 


19 


18 




0.13610 


684 


125 


110 


78 


20 


19 




0.13625 


954 


174 


157 


118 


21 


21 




0.13635 


1269 


231 


227 


170 


22 


22 


64 x 32 3 


0.13550 


323 


72 


52 


45 


17 


20 




0.13590 


520 


115 


83 


71 


20 


23 




0.13610 


748 


165 


120 


103 


21 


26 




0.13625 


1125 


248 


183 


171 


23 


29 




0.13635 


1663 


366 


294 


267 


25 


32 



* Using 24 and 64 processors, respectively, in the case of the 48 X 24 3 and the 64 x 32 3 lattice 



(EO+BiCGstab) [27,28] and the Schwarz-preconditioned GCR algorithm without 
deflation (SAP+GCR) [22]. In all cases, the tests consisted in measuring the solver 
iteration numbers and the computer time required for the solution of the full Dirac 
equation (3.9) to a precision where \\r] — Dip\\ < 10 _10 ||ry||. Timings were taken 
on a recent PC cluster with Infmiband network, using 12 and 32 double-processor 
nodes for the tests on the 48 x 24 3 and the 64 x 32 3 lattice respectively. Only highly 
optimized, parallel programs were used that include machine-specific enhancements 
such as those mentioned in ref. [22]. Quoted solver iteration numbers and timings 
are averages over 50 gauge- field configurations. 

The algorithmic parameters were set to the same values on the 48 x 24 3 and 
the 64 x 32 3 lattice. In particular, the deflation subspaces were constructed by 
applying 11 inverse iteration steps to N s = 20 random fields and by projecting 
them to a division of the lattice into blocks of size 4 4 . In the case of the Schwarz 
preconditioner, the block size was taken to be 8 x 4 3 and all other parameters were 
set to the standard values previously used in refs. [22,23,15]. A fairly small value, 
N^ v = 16, turned out to be a satisfactory choice for the maximal number N^ v of 
Krylov vectors that may be generated before the GCR algorithm is restarted (larger 
values, up to N^ v = 32, had to be used in the case of the SAP+GCR solver). 
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Fig. 3. Average execution time t needed for the solution of the lattice Dirac equation 
on the 64 x 32 3 lattice as a function of the bare valence quark mass m va i given in units 
of the lattice spacing a. The lattice, algorithm and test parameters are as specified in 
subsects. 2.1 and 6.3. Dotted lines are drawn to guide the eye. 



As is evident from the test results quoted in table 1, low-mode deflation signifi- 
cantly reduces both the solver iteration numbers and the time needed to solve the 
Dirac equation to a specified precision. Particularly impressive is the fact, illustrated 
in fig. 3, that the deflated algorithm has a flat scaling behaviour with respect to the 
quark mass. Moreover, the solver iteration numbers on the two lattices are nearly 
the same, which is very much in line with the expectation that the efficiency of the 
domain-decomposed deflation subspaces is independent of the lattice volume and 
that they provide a solution to the y 2 -problem. 

Contrary to the solver iteration numbers, the timings quoted in the last column of 
table 1 are sensitive to the time required for the application of the deflation projector 
Pl and thus to the average time needed for the solution of the little Dirac equation 
(see appendix A). The application of the projector actually consumed as much as 
25% of the total time on the small lattice and up to 30% on the big lattice. 
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6.4 Miscellaneous remarks 

(1) Partially quenched QCD. In the tests reported in the previous subsection, the 
deflation subspace was generated only once per gauge-field configuration, i.e. the 
same subspace was used at all values of the valence-quark mass considered. 

(2) Deflation overhead. The average time spent for the subspace generation was 150 
and 184 seconds, respectively, on the 48 x 24 3 and the 64 x 32 3 lattice. These figures 
include the time needed for the computation of the little Dirac operator (3.3 and 3.9 
seconds) . The computational effort required for the preparatory work thus becomes 
quickly negligible if several quark propagators are to be computed. 

(3) Solver stability. In the case of the deflated solver, the GCR iteration numbers 
Nqck tend to be very stable. The iteration numbers observed in the tests actually 
deviated by at most 1 from their average values, except at the smallest quark mass 
on the 64 x 32 3 lattice, where the maximal value of Nqcr ever seen was 27. 

(4) Acceleration of the HMC algorithm. The HMC simulation algorithm [29] requires 
the lattice Dirac equation to be solved at regular intervals along the trajectories in 
field space which lead from the current to the next configuration. Whether the use 
of low-mode deflation is profitable in this case depends on the quark mass and the 
precision requirements. 

On the lattices specified in subsect. 2.1, for example, an acceleration is achieved 
at hopping parameters K sea > 0.13625, if the relative solver tolerance is set to 10~ 7 
or less and if, say, 8 inverse iteration steps are used for the subspace generation. At 
these fairly small quark masses, the scaling behaviour of the HMC algorithm is then 
softened by nearly one power in the quark mass. 

In practice much larger speed-up factors can conceivably be obtained by updating 
the deflation subspace along the trajectories in field space rather than generating 
the subspace from scratch each time the Dirac equation must be solved. Moreover, 
starting from the exact factorization 

det£> = detAdetD (6.3) 

of the quark determinant, the HMC algorithm itself can perhaps be deflated too, 
in which case further accelerations and an improved stability of the algorithm will 
presumably be achieved. 
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7. Concluding remarks 

An important qualitative result of this paper is the demonstration that the low 
quark modes can be simultaneously deflated using local subspaces of low dimension. 
Some further clarification (an analytic proof of the local coherence of the low modes, 
for example) would certainly be welcome, but the numerical studies conducted so 
far leave little doubt that the construction does indeed provide a solution to the 
F 2 -problem. 

Variance reduction methods, such as low-mode averaging [4] and all-to-all propa- 
gator techniques [6,7], will probably be able to profit from these developments. The 
performance of the deflation-accelerated solver for the lattice Dirac equation dis- 
cussed in the previous section is, in any case, quite impressive, particularly so at the 
smallest quark masses considered. In many cases the computational effort required 
for the calculation of hadronic correlation functions is thus significantly reduced. 

The possible inclusion of deflation ideas in QCD simulation algorithms is an in- 
triguing perspective, since this may allow simulations close to the physical values of 
the light-quark masses to be performed with an effort not very much larger than the 
one required at a sea-quark mass equal to, say, a fourth of the physical strange-quark 
mass. 

I am indebted to Leonardo Giusti and Peter Weisz for critical comments on a first 
version of the paper. The gauge-field configurations used for the numerical studies 
reported in this paper were generated by the authors of ref. [15]. All computations 
were performed on a dedicated PC cluster at CERN and on a CRAY XT3 at the 
Swiss National Supercomputing Centre (CSCS). I am grateful to these institutions 
for providing the required computer resources. 



Appendix A. Solution of the little Dirac equation 

In practice the dimension N of the domain-decomposed deflation subspaces intro- 
duced in this paper tends to be so large that an exact solution of the little Dirac 
equation (6.1) is not a viable option. The iterative solver proposed here is based on 
even-odd preconditioning, global-mode deflation and the GCR algorithm. 
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Fig. 4. Support of the function D<f>{x) (black points) if 4>{x) is supported on the 
grey block in the centre of the figure. In particular, the matrix elements (A.l) vanish 
unless the blocks A and A' coincide or are nearest neighbours. 

A . 1 Computation of the little Dirac operator 

The block division of the lattice implies a decomposition of the little Dirac operator 
into N s x N s block matrices Baa', whose matrix elements are given by 

(B AA ,) kl = (4>£,D<f>f), k,l = l,...,N s . (A.l) 

Since the Wilson-Dirac operator has only nearest-neighbour hopping terms, most of 
these matrices vanish and a moment of thought reveals that the little Dirac operator 
actually couples nearest-neighbour blocks only (see fig. 4). 

The computation of the scalar products (A.l) is straightforward and requires a 
total effort proportional to the lattice volume times A^. Note, however, that the 
operations count tends to increase rapidly if lattice Dirac operators with hopping 
terms extending over two or more links are considered. 

A. 2 Even- odd preconditioning 

The even-odd preconditioning familiar from the full lattice Dirac operator can also 
be applied to the little Dirac operator in its block form if there is an even number 
of blocks in each direction (which is here assumed to be the case). If the so-called 
symmetric preconditioning is chosen [18], the block matrices representing the pre- 
conditioned operator on even blocks A, A' are given by 

Baa' = £>aa' - ^(-Baa)" 1 BAnjBnn) -1 Bqa' , (A. 2) 

n 

where the sum extends over the common neighbours f2 of the blocks A and A'. 
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The matrices (A. 2) do not need to be stored in the memory of the computer, be- 
cause the action of the preconditioned operator on a complex field can be computed 
in two steps, first hopping from the even to the odd blocks and then back to the 
even blocks. Some work can however be saved by storing the matrices (-Baa) _1 .Baa' 
for all nearest-neighbour pairs A, A' of blocks. 

On physically small blocks A, the diagonal block matrices Baa tend to be safely 
invertible, but the program should check this and return to the original system if 
an ill-conditioned matrix is encountered (this never happened in the tests reported 
in this paper). 

A. 3 Global-mode deflation 

As explained in subsect. 4.2, the basis fields 4>f{x) on the blocks A are obtained 
starting from a set of global fields ipi(x), I = 1, . . . , N s . The latter span a subspacc 
in the generated deflation subspace which may be used to deflate the little Dirac 
operator. Actually only the components of the global fields on the even blocks are 
used to build this "little deflation subspace" , because the little Dirac equation is to 
be deflated in its even-odd preconditioned form. 

The equation is deflated following the general procedures described in sect. 3. One 
simply has to replace the full Dirac operator by the even-odd preconditioned little 
Dirac operator and the quark fields by complex fields with N/2 components. Note 
that the "little little Dirac operator" is an N s x N s matrix that can be inverted to 
machine precision with a negligible effort. 

Global-mode deflation is straightforward to implement and tends to reduce the 
condition number of the little system quite significantly (by about a factor 3 in the 
cases studied so far). 

A. 4 Solver performance 

Similarly to the full system, the deflated preconditioned little equation can be solved 
using the GCR algorithm. Tests of the complete solver were then performed using 
the same subspaces as in subsect. 6.3. In particular, the number of GCR iterations 
Aqcr an d the time t needed to solve the little equation to a relative precision of 
10 -12 were determined and are quoted in columns 3 and 4 of table 2. 

The dependence of these figures on the valence quark mass and the lattice volume 
is noticeable, but one can also see that the solver iteration numbers increase only 
relatively slowly towards the smaller quark masses. In practice all these variations 
are not too important, because the solution of the little system eventually consumes 
only a fraction of the time spent for the solution of the full system. 
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Table 2. GCR iteration numbers and time * needed for the solution of the little system 



Lattice 


^val 


AT 

> GCR 


t [secj 


AT 


48 x 24 3 


0.13550 


84 


0.26 


24 




0.13590 


105 


0.32 


30 




0.13610 


120 


0.37 


34 




0.13625 


136 


0.42 


38 




0.13635 


150 


0.46 


42 


64 x 32 3 


0.13550 


94 


0.37 


27 




0.13590 


126 


0.49 


36 




0.13610 


154 


0.60 


44 




0.13625 


188 


0.73 


54 




0.13635 


220 


0.85 


62 



* Using 24 and 64 processors, respectively, in the case of the 48 x 24 3 and the 64 x 32 3 lattice 



A . 5 Using adapted precision 

It is still worth including another improvement, however, which exploits the fact 
that the outer GCR algorithm (the one that solves the full system) is restarted from 
time to time, usually when the dimension of the generated Krylov space reaches the 
specified maximal value. Before each restart, the current residue is recomputed with 
high precision so that any inaccuracies which may have accumulated during the last 
cycle do not propagate to the next cycle. 

For this reason it is permissible to solve the little Dirac equation to low precision 
inside the cycles of the outer algorithm. In the tests reported in subsect. 6.3, for 
example, the required relative tolerances were set to 10~ 6 and 10~ 12 , respectively, 
inside and outside the cycles of the algorithm. The average solver iteration numbers 
are then practically reduced by a factor 2. 

They can actually be reduced even further by adapting the precision as one pro- 
ceeds from one Krylov vector to the next within a cycle. This is possible because the 
GCR algorithm operates directly on the minimal residuals in the generated Krylov 
spaces. Their magnitude decreases monotonically and need to be computed essen- 
tially only to a fixed decimal precision. The required precision for the solution of 
the little system can therefore be reduced in proportion to the norm of the quark 
fields on which the deflation projector Pl acts. 
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Once all these improvements are installed, the average iteration numbers Nqcr 
required for the solution of the little system in the course of the cycles of the outer 
algorithm are reduced to the figures quoted in the last column of table 2. At the 
smallest quark mass on the 64 x 32 3 lattice, for example, the time spent for the 
solution of the little system sums up to about 6 seconds, i.e. about 19% of the total 
time needed for the solution of the full system. 
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